- name: kubernetes.node
  rules:
  - alert: CPUStealHigh
    expr: max by (node) (irate(node_cpu_seconds_total{mode="steal"}[30m]) * 100) > 10
    for: 30m
    labels:
      severity_level: "4"
      tier: cluster
    annotations:
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      description: |-
        The CPU steal is too high on the {{ $labels.node }} Node in the last 30 minutes.

        Probably, some other component is stealing Node resources (e.g., a neighboring virtual machine). This may be the result of "overselling" the hypervisor. In other words, there are more virtual machines than the hypervisor can handle.
      summary: CPU Steal on the {{ $labels.node }} Node is too high.

  - alert: NodeSystemExporterDoesNotExistsForNode
    expr: sum by (node) (kubernetes_build_info{job="kubelet"}) unless (sum by (node) (up{node=~".+", job="kubelet"}) and sum by (node) (up{node=~".+", job="node-exporter"}))
    for: 5m
    labels:
      severity_level: "4"
      tier: cluster
    annotations:
      plk_protocol_version: "1"
      plk_markup_format: markdown
      plk_create_group_if_not_exists__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      description: |-
        Some of the Node system exporters don't work correctly for the {{ $labels.node }} Node.

        The recommended course of action:
        1. Find the Node exporter Pod for this Node: `kubectl -n d8-monitoring get pod -l app=node-exporter -o json | jq -r ".items[] | select(.spec.nodeName==\"{{$labels.node}}\") | .metadata.name"`;
        2. Describe the Node exporter Pod: `kubectl -n d8-monitoring describe pod <pod_name>`;
        3. Check that kubelet is running on the {{ $labels.node }} node.

  - alert: NodeConntrackTableFull
    expr: max by (node) ( node_nf_conntrack_entries / node_nf_conntrack_entries_limit * 100 > 70 )
    for: 5m
    labels:
      severity_level: "4"
      tier: cluster
    annotations:
      plk_protocol_version: "1"
      plk_markup_format: markdown
      plk_create_group_if_not_exists__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      description: |-
        The conntrack table on the {{ $labels.node }} is {{ $value }}% of the maximum size.

        There's nothing to worry about yet if the `conntrack` table is only 70-80 percent full. However, if it runs out, you will experience problems with new connections while the software will behave strangely.

        The recommended course of action is to identify the source of "excess" `conntrack` entries using Okmeter or Grafana charts.
      summary: The `conntrack` table is close to the maximum size.

  - alert: NodeConntrackTableFull
    expr: max by (node) ( node_nf_conntrack_entries / node_nf_conntrack_entries_limit * 100 > 95 )
    for: 1m
    labels:
      severity_level: "3"
      tier: cluster
    annotations:
      plk_protocol_version: "1"
      plk_markup_format: markdown
      plk_create_group_if_not_exists__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      description: |-
        The `conntrack` table on the {{ $labels.node }} Node is full!

        No new connections are created or accepted on the Node; note that this may result in strange software issues.

        The recommended course of action is to identify the source of "excess" `conntrack` entries using Okmeter or Grafana charts.
      summary: The `conntrack` table is full.

  - alert: NodeUnschedulable
    expr: max by (node) (kube_node_spec_unschedulable) == 1
    for: 20m
    labels:
      severity_level: "8"
      tier: cluster
    annotations:
      plk_markup_format: "markdown"
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node is cordon-protected; no new Pods can be scheduled onto it.
      description: |-
        The {{ $labels.node }} Node is cordon-protected; no new Pods can be scheduled onto it.

        This means that someone has executed one of the following commands on that Node:
        - `kubectl cordon {{ $labels.node }}`
        - `kubectl drain {{ $labels.node }}` that runs for more than 20 minutes

        Probably, this is due to the maintenance of this Node.

  - alert: NodeSUnreclaimBytesUsageHigh
    expr: |
      max by (node) (predict_linear(node_memory_SUnreclaim_bytes[2h], 43200) > node_memory_MemTotal_bytes * 0.25)
      and
      max by (node) (node_memory_SUnreclaim_bytes > node_memory_MemTotal_bytes * 0.1)
    for: 20m
    labels:
      severity_level: "4"
      tier: cluster
    annotations:
      plk_markup_format: "markdown"
      plk_protocol_version: "1"
      plk_create_group_if_not_exists__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      plk_grouped_by__cluster_has_node_alerts: "ClusterHasNodeAlerts,tier=cluster,prometheus=deckhouse,kubernetes=~kubernetes"
      summary: The {{ $labels.node }} Node has high kernel memory usage.
      description: |-
        The {{ $labels.node }} Node has potential kernel memory leak. There is one known issue that can be reason for it.

        You should check cgroupDriver on the {{ $labels.node }} Node:
        - `cat /var/lib/kubelet/config.yaml | grep 'cgroupDriver: systemd'`

        If cgroupDriver is set to systemd then reboot is required to roll back to cgroupfs driver. Please, drain and reboot the node.

        You can check this [issue](https://github.com/deckhouse/deckhouse/issues/2152) for extra information.
